Large Scale Citation Matching Using Apache Hadoop
نویسندگان
چکیده
During the process of citation matching links from bibliography entries to referenced publications are created. Such links are indicators of topical similarity between linked texts, are used in assessing the impact of the referenced document and improve navigation in the user interfaces of digital libraries. In this paper we present a citation matching method and show how to scale it up to handle great amounts of data using appropriate indexing and a MapReduce paradigm in the Hadoop environment.
منابع مشابه
Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop
Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved...
متن کاملMatching Dispute Finder Claims to Wikipedia Articles
Dealing with large datasets is increasingly becoming a problem for natural language processing researchers. For our class project we investigate applying the opensource Hadoop MapReduce framework to the problem of information retrieval using TFIDF.
متن کاملARPN Journal of Science and Technology::Analysis of Movie Lens Data Set using Hive
Large scale data set provides the better opportunity to find out much better data relationship in the area of business intelligence. In the paper, we implement our systems using Hadoop that has been popular to store and compute Big Data. However, it is not easy to write Hadoop Map Reduce code. Therefore, we use Hive and Hive QL codes to understand the relationships between ratings and the users...
متن کاملSurvey on Information Retrieval and Pattern Matching for Compressed Data Size using the SVD Technique on Real Audio Dataset
Due to increasing size of text and audio data over internet, various techniques are needed to help with the finding and extraction of very specific information relevant to a user's task. Text mining is a variant on a field called data mining that tries to discover curious patterns from large databases. Singular value decomposition this technique is used for dimensionality reduction of large dat...
متن کاملDocument Clustering Through Non-Negative Matrix Factorization: A Case Study of Hadoop for Computational Time Reduction of Large Scale Documents
In this paper we discuss a new model for document clustering which has been adapted using non-negative matrix factorization method. The key idea is to cluster the documents after measuring the proximity of the documents with the extracted features. The extracted features are considered as the final cluster labels and clustering is done using cosine similarity which is equivalent to k-means with...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013